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Abstract 

This  paper  presents  a  general  architec¬ 
ture  and  four  algorithms  that  use  Natu¬ 
ral  Language  Processing  for  automatic  on¬ 
tology  matching.  The  proposed  approach 
is  purely  instance  based,  i.e.,  only  the 
instance  documents  associated  with  the 
nodes  of  ontologies  are  taken  into  account. 

The  four  algorithms  have  been  evaluated 
using  real  world  test  data,  taken  from  the 
Google  and  LookSmart  online  directories. 

The  results  show  that  NLP  techniques  ap¬ 
plied  to  instance  documents  help  the  sys¬ 
tem  achieve  higher  performance. 

1  Introduction 

Many  fundamental  issues  about  the  viability  and 
exploitation  of  the  web  as  a  linguistic  corpus  have 
not  been  tackled  yet.  The  web  is  a  massive  reposi¬ 
tory  of  text  and  multimedia  data.  However,  there  is 
not  a  systematic  way  of  classifying  and  retrieving 
these  documents.  Computational  Linguists  arc  of 
course  not  the  only  ones  looking  at  these  issues; 
research  on  the  Semantic  Web  focuses  on  pro¬ 
viding  a  semantic  description  of  all  the  resources 
on  the  web,  resulting  into  u  mesh  of  information 
linked  up  in  such  a  way  as  to  be  easily  process- 
able  by  machines,  on  a  global  scale.  You  can  think 
of  it  as  being  an  efficient  way  of  representing  data 
on  the  World  Wide  Web,  or  as  a  globally  linked 
database . 1  The  way  the  vision  of  the  Semantic 
Web  will  be  achieved,  is  by  describing  each  doc¬ 
ument  using  languages  such  as  RDF  Schema  and 
OWL,  which  arc  capable  of  explicitly  expressing 
the  meaning  of  terms  in  vocabularies  and  the  rela¬ 
tionships  between  those  terms. 

'http : / / infomesh . net/ 2001/swintro/ 


The  issue  we  arc  focusing  on  in  this  paper  is 
that  these  languages  are  used  to  define  ontologies 
as  well.  If  ultimately  a  single  ontology  were  used 
to  describe  all  the  documents  on  the  web,  sys¬ 
tems  would  be  able  to  exchange  information  in  a 
transparent  way  for  the  end  user.  The  availability 
of  such  a  standard  ontology  would  be  extremely 
helpful  to  NLP  as  well,  e.g.,  it  would  make  it  far 
easier  to  retrieve  all  documents  on  a  certain  topic. 
However,  until  this  vision  becomes  a  reality,  a  plu¬ 
rality  of  ontologies  arc  being  used  to  describe  doc¬ 
uments  and  their  content.  The  task  of  automatic 
ontology  alignment  or  matching  (Hughes  and  Ash- 
pole,  2005)  then  needs  to  be  addressed. 

The  task  of  ontology  matching  has  been  typi¬ 
cally  carried  out  manually  or  semi-automatically, 
for  example  through  the  use  of  graphical  user  in¬ 
terfaces  (Noy  and  Musen,  2000).  Previous  work 
has  been  done  to  provide  automated  support  to  this 
time  consuming  task  (Rahm  and  Bernstein,  2001; 
Cruz  and  Rajendran,  2003;  Doan  et  al.,  2003; 
Cruz  et  al.,  2004;  Subba  and  Masud,  2004).  The 
various  methods  can  be  classified  into  two  main 
categories:  schema  based  and  instance  based. 
Schema  based  approaches  try  to  infer  the  seman¬ 
tic  mappings  by  exploiting  information  related  to 
the  structure  of  the  ontologies  to  be  matched,  like 
their  topological  properties,  the  labels  or  descrip¬ 
tion  of  their  nodes,  and  structural  constraints  de¬ 
fined  on  the  schemas  of  the  ontologies.  These 
methods  do  not  take  into  account  the  actual  data 
classified  by  the  ontologies.  On  the  other  hand, 
instance  based  approaches  look  at  the  information 
contained  in  the  instances  of  each  element  of  the 
schema.  These  methods  tty  to  infer  the  relation¬ 
ships  between  the  nodes  of  the  ontologies  from 
the  analysis  of  their  instances.  Finally,  hybrid 
approaches  combine  schema  and  instance  based 
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methods  into  integrated  systems. 

Neither  instance  level  information,  nor  NLP 
techniques  have  been  extensively  explored  in  pre¬ 
vious  work  on  ontology  matching.  For  exam¬ 
ple,  (Agirre  et  ah,  2000)  exploits  documents  (in¬ 
stances)  on  the  WWW  to  enrich  WordNet  (Miller 
et  al.,  1990),  i.e.,  to  compute  “concept  signatures,” 
collection  of  words  that  significantly  distinguish 
one  sense  from  another,  however,  not  directly  for 
ontology  matching.  (Liu  et  ah,  2005)  uses  doc¬ 
uments  retrieved  via  queries  augmented  with,  for 
example,  synonyms  that  WordNet  provides  to  im¬ 
prove  the  accuracy  of  the  queries  themselves,  but 
not  for  ontology  matching.  NLP  techniques  such 
as  POS  tagging,  or  parsing,  have  been  used  for 
ontology  matching,  but  on  the  names  and  defini¬ 
tions  in  the  ontology  itself,  for  example,  in  (Hovy, 
2002),  hence  with  a  schema  based  methodology. 

In  this  paper,  we  describe  the  results  we  ob¬ 
tained  when  using  some  simple  but  effective  NLP 
methods  to  align  web  ontologies,  using  an  instance 
based  approach.  As  we  will  see,  our  results  show 
that  more  sophisticated  methods  do  not  necessar¬ 
ily  lead  to  better  results. 

2  General  architecture 

The  instance  based  approach  we  propose  uses 
NLP  techniques  to  compute  matching  scores 
based  on  the  documents  classified  under  the  nodes 
of  ontologies.  There  is  no  assumption  on  the  struc¬ 
tural  properties  of  the  ontologies  to  be  compared: 
they  can  be  any  kind  of  graph  representable  in 
OWL.  The  instance  documents  are  assumed  to  be 
text  documents  (plain  text  or  HTML). 

The  matching  process  starts  from  a  pair  of  on¬ 
tologies  to  be  aligned.  The  two  ontologies  are 
traversed  and,  for  each  node  having  at  least  one 
instance,  the  system  computes  a  signature  based 
on  the  instance  documents.  Then,  the  signatures 
associated  to  the  nodes  of  the  two  ontologies  are 
compared  pairwise,  and  a  similarity  score  for  each 
pair  is  generated.  This  score  could  then  be  used 
to  estimate  the  likelihood  of  a  match  between  a 
pair  of  nodes,  under  the  assumption  that  the  se¬ 
mantics  of  a  node  corresponds  to  the  semantics  of 
the  instance  documents  classified  under  that  node. 
Figure  1  shows  the  architecture  of  our  system. 

The  two  main  issues  to  be  addressed  are  (1) 
the  representation  of  signatures  and  (2)  the  def¬ 
inition  of  a  suitable  comparison  metric  between 
signatures.  For  a  long  time,  the  Information  Re¬ 


trieval  community  has  succesfully  adopted  a  “bag 
of  words”  approach  to  effectively  represent  and 
compare  text  documents.  We  start  from  there  to 
define  a  general  signature  structure  and  a  metric  to 
compare  signatures. 

A  signature  is  defined  as  a  function  S  :  K  — » 
R+ ,  mapping  a  finite  set  of  keys  (which  can  be 
complex  objects)  to  positive  real  values.  With  a 
signature  of  that  form,  we  can  use  the  cosine  sim¬ 
ilarity  metric  to  score  the  similarity  between  two 
signatures: 

’p)s2(kp) 

kp  G  Ki  n  I<2,  ki  G  Ki,  kj  G  K2 

The  cosine  similarity  formula  produces  a  value 
in  the  range  [0,  1],  The  meaning  of  that  value  de¬ 
pends  on  the  algorithm  used  to  build  the  signa¬ 
ture.  In  particular,  there  is  no  predefined  thresh¬ 
old  that  can  be  used  to  discriminate  matches  from 
non-matches.  However,  such  a  threshold  could  be 
computed  a-posteriori  from  a  statistical  analysis  of 
experimental  results. 

2.1  Signature  generation  algorithms 

For  our  experiments,  we  defined  and  implemented 
four  algorithms  to  generate  signatures.  The  four 
algorithms  make  use  of  text  and  language  process¬ 
ing  techniques  of  increasing  complexity. 

2.1.1  Algorithm  1:  Baseline  signature 

The  baseline  algorithm  performs  a  very  simple 
sequence  of  text  processing,  schematically  repre¬ 
sented  in  Figure  2. 


simil(Si,  S2)  = 


EPs  iQ 
'EiSiihf 
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Figure  3:  Noun  signature  creation 

HTML  tags  are  first  removed  from  the  in¬ 
stance  documents.  Then,  the  texts  arc  tokenized 
and  punctuation  is  removed.  Everything  is  then 
converted  to  lowercase.  Finally,  the  tokens  are 
grouped  and  counted.  The  final  signature  has  the 
form  of  a  mapping  table  token  — ► frequency . 

The  main  problem  we  expected  with  this 
method  is  the  presence  of  a  lot  of  noise.  In  fact, 
many  “irrelevant”  words,  like  determiners,  prepo¬ 
sitions,  and  so  on,  arc  added  to  the  final  signature. 

2.1.2  Algorithm  2:  Noun  signature 

To  cope  with  the  problem  of  excessive  noise, 
people  in  IR  often  use  fixed  lists  of  stop  words 
to  be  removed  from  the  texts.  Instead,  we  intro¬ 
duced  a  syntax  based  filter  in  our  chain  of  pro¬ 
cessing.  The  main  assuption  is  that  nouns  arc  the 
words  that  carry  most  of  the  meaning  for  our  kind 
of  document  comparison.  Thus,  we  introduced 
a  part-of-speech  tagger  right  after  the  tokeniza- 
tion  module  (Figure  3).  The  results  of  the  tagger 
are  used  to  discard  everything  but  nouns  from  the 
input  documents.  The  part-of-speech  tagger  we 
used  -QTAG  3.1  (Tufis  and  Mason,  1998),  readily 
available  on  the  web  as  a  Java  library-  is  a  Hidden 
Markov  Model  based  statistical  tagger. 

The  problems  we  expected  with  this  approach 
are  related  to  the  high  specialization  of  words  in 
natural  language.  Different  nouns  can  hear  simi¬ 
lar-  meaning,  but  our  system  would  treat  them  as  if 
they  were  completely  unrelated  words.  For  exam¬ 
ple,  the  words  “apple”  and  “orange”  are  semanti¬ 
cally  closer  than  “apple”  and  “chair,”  but  a  purely 
syntactic  approach  would  not  make  any  difference 
between  these  two  pairs.  Also,  the  current  method 
does  not  include  morphological  processing,  so  dif¬ 
ferent  inflections  of  the  same  word,  such  as  “ap¬ 
ple”  and  “apples,”  are  treated  as  distinct  words. 

In  further  experiments,  we  also  considered 
verbs,  another  syntactic  category  of  words  hearing 
a  lot  of  semantics  in  natural  language.  We  com¬ 
puted  signatures  with  verbs  only,  and  with  verbs 
and  nouns  together.  In  both  cases,  however,  the 


Figure  4:  WordNet  signature  creation 

performance  of  the  system  was  worse.  Thus,  we 
will  not  consider  verbs  in  the  rest  of  the  paper. 

2.1.3  Algorithm  3:  WordNet  signature 

To  address  the  limitations  stated  above,  we  used 
the  WordNet  lexical  resource  (Miller  et  ah,  1990). 
WordNet  is  a  dictionary  where  words  are  linked 
together  by  semantic  relationships.  In  Word- 
Net,  words  are  grouped  into  synsets,  i.e.,  sets  of 
synonyms.  Each  synset  can  have  links  to  other 
synsets.  These  links  represent  semantic  relation¬ 
ships  like  hypernymy,  hyponymy,  and  so  on. 

In  our  approach,  after  the  extraction  of  nouns 
and  their  grouping,  each  noun  is  looked  up  on 
WordNet  (Figure  4).  The  synsets  to  which  the 
noun  belongs  are  added  to  the  final  signature  in 
place  of  the  noun  itself.  The  signature  can  also 
be  enriched  with  the  hypernyms  of  these  synsets, 
up  to  a  specified  level.  The  final  signature  has  the 
form  of  a  mapping  synset  — >  value ,  where  value  is 
a  weighted  sum  of  all  the  synsets  found. 

Two  important  parameters  of  this  method  are 
related  to  the  hypernym  expansion  process  men¬ 
tioned  above.  The  first  parameter  is  the  maximum 
level  of  hypernyms  to  be  added  to  the  signature 
(; hypernym  level).  A  hypernym  level  value  of  0 
would  make  the  algorithm  add  only  the  synsets  of 
a  word,  without  any  hypernym,  to  the  signature.  A 
value  of  1  would  cause  the  algorithm  to  add  also 
their  parents  in  the  hypernym  hierarchy  to  the  sig¬ 
nature.  With  higher  values,  all  the  ancestors  up  to 
the  specified  level  are  added.  The  second  parame¬ 
ter,  hypernym  factor,  specifies  the  damping  of  the 
weight  of  the  hypernyms  in  the  expansion  process. 
Our  algorithm  exponentially  dampens  the  hyper¬ 
nyms,  i.e.,  the  weigth  of  a  hypernym  decreases  ex¬ 
ponentially  as  its  level  increases.  The  hypernym 
factor  is  the  base  of  the  exponential  function. 

In  general,  a  noun  can  have  more  than  one 
sense,  e.g.,  “apple”  can  be  either  a  fruit  or  a  tree. 
This  is  reflected  in  WordNet  by  the  fact  that  a 
noun  can  belong  to  multiple  synsets.  With  the 
current  approach,  the  system  cannot  decide  which 
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Figure  5:  Disambiguated  signature  creation 


sense  is  the  most  appropriate,  so  all  the  senses 
of  a  word  arc  added  to  the  final  signature,  with 
a  weight  inversely  proportional  to  the  number  of 
possible  senses  of  that  word.  This  fact  poten¬ 
tially  introduces  semantic  noise  in  the  signature, 
because  many  irrelevant  senses  might  be  added  to 
the  signature  itself. 

Another  limitation  is  that  a  portion  of  the  nouns 
in  the  source  texts  cannot  be  located  in  WordNet 
(see  Figure  6).  Thus,  we  also  tried  a  variation  (al¬ 
gorithm  3+2)  that  falls  back  on  to  the  bare  lexi¬ 
cal  form  of  a  noun  if  it  cannot  be  found  in  Word- 
Net.  This  variation,  however,  resulted  in  a  slight 
decrease  of  performance. 

2.1.4  Algorithm  4:  Disambiguated  signature 

The  problem  of  having  multiple  senses  for  each 
word  calls  for  the  adoption  of  word  sense  dis¬ 
ambiguation  techniques.  Thus,  we  implemented 
a  word  sense  disambiguator  algorithm,  and  we 
inserted  it  into  the  signature  generation  pipeline 
(Figure  5).  For  each  noun  in  the  input  documents, 
the  disambiguator  takes  into  account  a  specified 
number  of  context  words ,  i.e.,  nouns  preceding 
and/or  following  the  target  word.  The  algorithm 
computes  a  measure  of  the  semantic  distance  be¬ 
tween  the  possible  senses  of  the  target  word  and 
the  senses  of  each  of  its  context  words,  pair¬ 
wise.  A  sense  for  the  target  word  is  chosen  such 
that  the  total  distance  to  its  context  is  minimized. 
The  semantic  distance  between  two  synsets  is  de¬ 
fined  here  as  the  minimum  number  of  hops  in 
the  WordNet  hypernym  hierarchy  connecting  the 
two  synsets.  This  definition  allows  for  a  rela¬ 
tively  straightforward  computation  of  the  seman¬ 
tic  distance  using  WordNet.  Other  more  sophisti¬ 
cated  definitions  of  semantic  distance  can  be  found 
in  (Patwardhan  et  al.,  2003).  The  word  sense 
disambiguation  algorithm  we  implemented  is  cer¬ 
tainly  simpler  than  others  proposed  in  the  litera¬ 
ture,  but  we  used  it  to  see  whether  a  method  that  is 
relatively  simple  to  implement  could  still  help. 

The  overall  parameters  for  this  signature  cre¬ 


ation  algorithm  arc  the  same  as  the  WordNet  sig¬ 
nature  algorithm,  plus  two  additional  parameters 
for  the  word  sense  disambiguator:  left  context 
length  and  right  context  length.  They  represent  re¬ 
spectively  how  many  nouns  before  and  after  the 
target  should  be  taken  into  account  by  the  dis¬ 
ambiguator.  If  those  two  parameters  arc  both  set 
to  zero,  then  no  context  is  provided,  and  the  first 
possible  sense  is  chosen.  Notice  that  even  in  this 
case  the  behaviour  of  this  signature  generation  al¬ 
gorithm  is  different  from  the  previous  one.  In 
a  WordNet  signature,  every  possible  sense  for  a 
word  is  inserted,  whereas  in  a  WordNet  disam¬ 
biguated  signature  only  one  sense  is  added. 

3  Experimental  setting 

All  the  algorithms  described  in  the  previous  sec¬ 
tion  have  been  fully  implemented  in  a  coherent 
and  extensible  framework  using  the  Java  program¬ 
ming  language,  and  evaluation  experiments  have 
been  run.  This  section  describes  how  the  experi¬ 
ments  have  been  conducted. 

3.1  Test  data 

The  evaluation  of  ontology  matching  approaches 
is  usually  made  difficult  by  the  scarceness  of  test 
ontologies  readily  available  in  the  community. 
This  problem  is  even  worse  for  instance  based  ap¬ 
proaches,  because  the  test  ontologies  need  also  to 
be  “filled”  with  instance  documents.  Also,  we 
wanted  to  test  our  algorithms  with  “real  world” 
data,  rather  than  toy  examples. 

We  were  able  to  collect  suitable  test  data  start¬ 
ing  from  the  ontologies  published  by  the  Ontology 
Alignment  Evaluation  Initiative  2005  (Euzenat  et 
al.,  2005).  A  section  of  their  data  contained  an 
OWL  representation  of  fragments  of  the  Google, 
Yahoo,  and  LookSmart  web  directories.  We  “re¬ 
verse  engineered”  some  of  this  fragments,  in  or¬ 
der  to  reconstruct  two  consistent  trees,  one  rep¬ 
resenting  part  of  the  Google  directory  structure, 
the  other  representing  part  of  the  LookSmart  hi¬ 
erarchy.  The  leaf  nodes  of  these  trees  were  filled 
with  instances  downloaded  from  the  web  pages 
classified  by  the  appropriate  directories.  With  this 
method,  we  were  able  to  fill  7  nodes  of  each  ontol¬ 
ogy  with  10  documents  per  node,  for  a  total  of  140 
documents.  Each  document  came  from  a  distinct 
web  page,  so  there  was  no  overlap  in  the  data  to  be 
compared.  A  graphical  representation  of  our  two 
test  ontologies,  source  and  target ,  is  shown  in  Fig- 
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ure  6.  The  darker  outlined  nodes  are  those  filled 
with  instance  documents.  For  the  sake  of  readabil¬ 
ity,  the  names  of  the  nodes  corresponding  to  real 
matches  arc  the  same.  Of  course,  this  informa¬ 
tion  is  not  used  by  our  algorithms,  which  adopt  a 
purely  instance  based  approach.  Figure  6  also  re¬ 
ports  the  size  of  the  instance  documents  associated 
to  each  node:  total  number  of  words,  noun  tokens, 
nouns,  and  nouns  covered  by  WordNet. 

3.2  Parameters 

The  experiments  have  been  run  with  several  com¬ 
binations  of  the  relevant  parameters:  number  of 
instance  documents  per  node  (5  or  10),  algorithm 
(1  to  4),  extracted  parts  of  speech  (nouns,  verbs,  or 
both),  hypernym  level  (an  integer  value  equal  or 
greater  than  zero),  hypernym  factor  (a  real  num¬ 
ber),  and  context  length  (an  integer  number  equal 
or  greater  than  zero).  Not  all  of  the  parameters  are 
applicable  to  every  algorithm.  The  total  number  of 
runs  was  90. 

4  Results 

Each  run  of  the  system  with  our  test  ontologies 
produced  a  set  of  49  values,  representing  the 
matching  score  of  every  pair  of  nodes  containing 
instances  across  the  two  ontologies.  Selected  ex¬ 
amples  of  these  results  are  shown  in  Tables  1,  2, 
3,  and  4.  In  the  experiments  shown  in  those  ta¬ 
bles,  10  instance  documents  for  each  node  were 
used  to  compute  the  signatures.  Nodes  that  ac¬ 
tually  match  (identified  by  the  same  label,  e.g., 
“Canada”  and  “Canada”)  should  show  high  sim¬ 
ilarity  scores,  whereas  nodes  that  do  not  match 
(e.g.,  “Canada”  and  “Dendrochronology”),  should 
have  low  scores.  Better  algorithms  would  have 
higher  scores  for  matching  nodes,  and  lower  score 
for  non-matching  ones.  Notice  that  the  two  nodes 
“Egypt”  and  “Pyramid  Theories,”  although  intu¬ 
itively  related,  have  documents  that  take  different 
perspectives  on  the  subject.  So,  the  algorithms 
correctly  identify  the  nodes  as  being  different. 

Looking  at  the  results  in  this  form  makes  it  dif¬ 
ficult  to  precisely  assess  the  quality  of  the  algo¬ 
rithms.  To  do  so,  a  statistical  analysis  has  to  be 
performed.  For  each  table  of  results,  let  us  parti¬ 
tion  the  scores  in  two  distinct  sets: 

A  =  {simil(nodei,  nodej)  |  real  match  =  true } 
B  =  {simil(nodei,  nodej)  |  real  match  =  false} 


Target  node 

Source 

node 

Canada 

Dendro 

chronology 

Mega 

liths 

Muse 

urns 

Nazca 

Lines 

Pyramid 

Theories 

United 

Kingdom 

Canada 

0.95 

0.89 

0.89 

0.91 

0.87 

0.86 

0.92 

Dendro 

chronology 

0.90 

0.97 

0.91 

0.90 

0.88 

0.87 

0.92 

Egypt 

0.86 

0.89 

0.91 

0.87 

0.86 

0.88 

0.90 

Megaliths 

0.90 

0.91 

0.99 

0.93 

0.95 

0.94 

0.93 

Museums 

0.89 

0.88 

0.90 

0.93 

0.88 

0.87 

0.90 

Nazca 

Lines 

0.88 

0.88 

0.95 

0.91 

0.99 

0.93 

0.91 

United 

Kingdom 

0.87 

0.87 

0.86 

0.88 

0.82 

0.82 

0.96 

Table  1:  Results  -  Baseline  signature  algorithm 


Target  node 

Source 

node 

Canada 

Dendro 

chronology 

Mega 

liths 

Muse 

urns 

Nazca 

Lines 

Pyramid 

Theories 

United 

Kingdom 

Canada 

0.67 

0.20 

0.14 

0.35 

0.08 

0.08 

0.41 

Dendro 

chronology 

0.22 

0.80 

0.15 

0.22 

0.09 

0.09 

0.25 

Egypt 

0.13 

0.23 

0.26 

0.22 

0.17 

0.24 

0.25 

Megaliths 

0.28 

0.20 

0.85 

0.37 

0.22 

0.27 

0.33 

Museums 

0.30 

0.19 

0.18 

0.58 

0.08 

0.14 

0.27 

Nazca 

Lines 

0.13 

0.12 

0.26 

0.18 

0.96 

0.14 

0.17 

United 

Kingdom 

0.42 

0.20 

0.17 

0.26 

0.09 

0.11 

0.80 

Table  2:  Results  -  Noun  signature  algorithm 


Target  node 

Source 

node 

Canada 

Dendro 

chronology 

Mega 

liths 

Muse 

urns 

Nazca 

Lines 

Pyramid 

Theories 

United 

Kingdom 

Canada 

0.79 

0.19 

0.19 

0.38 

0.15 

0.06 

0.56 

Dendro 

chronology 

0.26 

0.83 

0.18 

0.20 

0.16 

0.07 

0.24 

Egypt 

0.17 

0.24 

0.32 

0.21 

0.31 

0.30 

0.27 

Megaliths 

0.39 

0.21 

0.81 

0.41 

0.40 

0.25 

0.42 

Museums 

0.31 

0.14 

0.17 

0.70 

0.11 

0.11 

0.26 

Nazca 

Lines 

0.24 

0.20 

0.42 

0.29 

0.91 

0.21 

0.29 

United 

Kingdom 

0.56 

0.17 

0.22 

0.25 

0.15 

0.08 

0.84 

Table  3:  Results  -  WordNet  signature  algorithm 
(hypernym  level=0) 
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Figure  6:  Ontologies  used  in  the  experiments.  The  numbers  below  the  leaves  indicate  the  size  of  instance 
documents:  #  of  words;  #  of  noun  tokens;  #  of  nouns;  #  of  nouns  in  WordNet 


T arget  node 

Source 

node 

Canada 

Dendro 

chronology 

Mega 

liths 

Muse 

urns 

Naze  a 
Lines 

Pyramid 

Theories 

United 

Kingdom 

Canada 

0.68 

0.18 

0.13 

0.33 

0.12 

0.05 

0.44 

Dendro 

chronology 

0.23 

0.79 

0.15 

0.20 

0.14 

0.07 

0.23 

Egypt 

0.15 

0.23 

0.28 

0.22 

0.27 

0.31 

0.27 

Megaliths 

0.30 

0.18 

0.84 

0.37 

0.34 

0.27 

0.33 

Museums 

0.29 

0.16 

0.15 

0.60 

0.11 

0.10 

0.24 

Naze  a 
Lines 

0.20 

0.17 

0.38 

0.26 

0.89 

0.21 

0.26 

United 

Kingdom 

0.45 

0.17 

0.18 

0.24 

0.15 

0.08 

0.80 

Table  4:  Results  -  Disambiguated  signature  al¬ 
gorithm  (hypernym  level=0,  left  context=l,  right 
context=l) 


With  our  test  data,  we  would  have  6  values  in 
set  A  and  43  values  in  set  B.  Then,  let  us  com¬ 
pute  average  and  standard  deviation  of  the  values 
included  in  each  set.  The  average  of  A  represents 
the  expected  score  that  the  system  would  assign 
to  a  match;  likewise,  the  average  of  B  is  the  ex¬ 
pected  score  of  a  non-match.  We  define  the  fol¬ 
lowing  measure  to  compare  the  performance  of 
our  matching  algorithms,  inspired  by  “effect  size” 
from  (VanLehn  et  ah,  2005): 


discrimination  size  = 


avg{A)  -  avg(B ) 
stdev(A)  +  stdev(B) 


Higher  discrimination  values  mean  that  the 
scores  assigned  to  matches  and  non-matches  are 
more  “far  away,”  making  it  possible  to  use  those 
scores  to  make  more  reliable  decisions  about  the 
matching  degree  of  pairs  of  nodes. 


Table  5  shows  the  values  of  discrimination  size 
(last  column)  out  of  selected  results  from  our  ex¬ 
periments.  The  algorithm  used  is  reported  in  the 
first  column,  and  the  values  of  the  other  relevant 
parameters  arc  indicated  in  other  columns.  We  can 
make  the  following  observations. 

•  Algorithms  2,  3,  and  4  generally  outperform 
the  baseline  (algorithm  1). 

•  Algorithm  2  (Noun  signature),  which  still 
uses  a  fairly  simple  and  purely  syntactical 
technique,  shows  a  substantial  improvement. 
Algorithm  3  (WordNet  signature),  which  in¬ 
troduces  some  additional  level  of  semantics, 
has  even  better  performance. 

•  In  algorithms  3  and  4,  hypernym  expansion 
looks  detrimental  to  performance.  In  fact,  the 
best  results  arc  obtained  with  hypernym  level 
equal  to  zero  (no  hypernym  expansion). 

•  The  word  sense  disambiguator  implemented 
in  algorithm  4  does  not  help.  Even  though 
disambiguating  with  some  limited  context 
(1  word  before  and  1  word  after)  provides 
slightly  better  results  than  choosing  the  first 
available  sense  for  a  word  (context  length 
equal  to  zero),  the  overall  results  arc  worse 
than  adding  all  the  possible  senses  to  the  sig¬ 
nature  (algorithm  3). 

•  Using  only  5  documents  per  node  signifi¬ 
cantly  degrades  the  performance  of  all  the  al¬ 
gorithms  (see  the  last  5  lines  of  the  table). 

5  Conclusions  and  future  work 

The  results  of  our  experiments  point  out  several 
research  questions  and  directions  for  future  work. 
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Alg 

Docs 

POS 

Hyp  lev 

Hyp  fac 

L  cont 

R  cont 

Avg  (A) 

Stdev  (A) 

Avg  (B) 

Stdev  (B) 

Discrimination  size 

1 

10 

0.96 

0.02 

0.89 

0.03 

1.37 

2 

10 

noun 

0.78 

0.13 

0.21 

0.09 

2.55 

2 

10 

verb 

0.64 

0.20 

0.31 

0.11 

1.04 

2 

10 

nn+vb 

0.77 

0.14 

0.21 

0.09 

2.48 

3 

10 

noun 

0 

0.81 

0.07 

0.25 

0.12 

3.08 

3 

10 

noun 

1 

1 

0.85 

0.07 

0.41 

0.12 

2.35 

3 

10 

noun 

1 

2 

0.84 

0.07 

0.34 

0.12 

2.64 

3 

10 

noun 

1 

3 

0.83 

0.07 

0.31 

0.12 

2.80 

3 

10 

noun 

2 

1 

0.90 

0.06 

0.62 

0.11 

1.64 

3 

10 

noun 

2 

2 

0.86 

0.07 

0.45 

0.12 

2.18 

3 

10 

noun 

2 

3 

0.84 

0.07 

0.36 

0.12 

2.56 

3 

10 

noun 

3 

1 

0.95 

0.04 

0.78 

0.08 

1.44 

3 

10 

noun 

3 

2 

0.88 

0.07 

0.52 

0.12 

1.91 

3 

10 

noun 

3 

3 

0.85 

0.07 

0.38 

0.12 

2.45 

3+2 

10 

noun 

0 

0 

0.80 

0.09 

0.21 

0.11 

2.94 

3+2 

10 

noun 

1 

2 

0.83 

0.08 

0.30 

0.11 

2.73 

3+2 

10 

noun 

2 

2 

0.85 

0.08 

0.39 

0.11 

2.40 

4 

10 

noun 

0 

0 

0 

0.80 

0.12 

0.24 

0.10 

2.64 

4 

10 

noun 

0 

1 

1 

0.77 

0.11 

0.22 

0.10 

2.67 

4 

10 

noun 

0 

2 

2 

0.77 

0.11 

0.23 

0.10 

2.59 

4 

10 

noun 

1 

2 

0 

0 

0.82 

0.10 

0.29 

0.10 

2.56 

4 

10 

noun 

1 

2 

1 

1 

0.80 

0.10 

0.34 

0.10 

2.27 

4 

10 

noun 

1 

2 

2 

2 

0.80 

0.10 

0.35 

0.10 

2.22 

1 

5 

noun 

0.93 

0.05 

0.86 

0.04 

0.88 

2 

5 

noun 

0.66 

0.23 

0.17 

0.08 

1.61 

3 

5 

noun 

0 

0.70 

0.17 

0.21 

0.11 

1.76 

4 

5 

noun 

0 

0 

0 

0.69 

0.21 

0.20 

0.09 

1.63 

4 

5 

noun 

0 

1 

1 

0.64 

0.21 

0.18 

0.08 

1.58 

Table  5 :  Results  -  Discrimination  size 


some  more  specific  and  some  more  general.  As 
regards  the  more  specific  issues, 

•  Algorithm  2  does  not  perform  morphological 
processing,  whereas  Algorithm  3  does.  How 
much  of  the  improved  effectiveness  of  Algo¬ 
rithm  3  is  due  to  this  fact?  To  answer  this 
question.  Algorithm  2  could  be  enhanced  to 
include  a  morphological  processor. 

•  The  effectiveness  of  Algorithms  3  and  4  may 
be  hindered  by  the  fact  that  many  words  are 
not  yet  included  in  the  WordNet  database  (see 
Figure  6).  Falling  back  on  to  Algorithm  2 
proved  not  to  be  a  solution.  The  impact  of  the 
incompleteness  of  the  lexical  resource  should 
be  investigated  and  assessed  more  precisely. 
Another  venue  of  research  may  be  to  exploit 
different  thesauri,  such  as  the  ones  automati¬ 
cally  derived  as  in  (Curran  and  Moens,  2002). 

•  The  performance  of  Algorithm  4  might  be 
improved  by  using  more  sophisticated  word 
sense  disambiguation  methods.  It  would  also 
be  interesting  to  explore  the  application  of 
the  unsupervised  method  described  in  (Mc¬ 
Carthy  et  al.,  2004). 


As  regards  our  long  term  plans,  first,  structural 
properties  of  the  ontologies  could  potentially  be 
exploited  for  the  computation  of  node  signatures. 
This  kind  of  enhancement  would  make  our  system 
move  from  a  purely  instance  based  approach  to  a 
combined  hybrid  approach  based  on  schema  and 
instances. 

More  fundamentally,  we  need  to  address  the 
lack  of  appropriate,  domain  specific  resources  that 
can  support  the  training  of  algorithms  and  models 
appropriate  for  the  task  at  hand.  WordNet  is  a  very 
general  lexicon  that  does  not  support  domain  spe¬ 
cific  vocabulary,  such  as  that  used  in  geosciences 
or  in  medicine  or  simply  that  contained  in  a  sub¬ 
ontology  that  users  may  define  according  to  their 
interests.  Of  course,  we  do  not  want  to  develop 
by  hand  domain  specific  resources  that  we  have  to 
change  each  time  a  new  domain  arises. 

The  crucial  research  issue  is  how  to  exploit  ex¬ 
tremely  scarce  resources  to  build  efficient  and  ef¬ 
fective  models.  The  issue  of  scarce  resources 
makes  it  impossible  to  use  methods  that  are  suc- 
cesful  at  discriminating  documents  based  on  the 
words  they  contain  but  that  need  large  corpora 
for  training,  for  example  Latent  Semantic  Anal¬ 
ysis  (Landauer  et  al.,  1998).  The  experiments  de¬ 
scribed  in  this  paper  could  be  seen  as  providing 
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a  bootstrapped  model  (Riloff  and  Jones,  1999;  Ng 
and  Cardie,  2003) — in  ML,  bootstrapping  requires 
to  seed  the  classifier  with  a  small  number  of  well 
chosen  target  examples.  We  could  develop  a  web 
spider,  based  on  the  work  described  on  this  paper, 
to  automatically  retrieve  larger  amounts  of  train¬ 
ing  and  test  data,  that  in  turn  could  be  processed 
widi  more  sophisticated  NLP  techniques. 
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